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ABSTRACT 

CyanoLyase (http://cyanolyase.genouest.org/) is a 
manually curated sequence and motif database of 
phycobilin lyases and related proteins. These 
enzymes catalyze the covalent ligation of chromo- 
phores (phycobilins) to specific binding sites of 
phycobiliproteins (PBPs). The latter constitute the 
building bricks of phycobilisomes, the major 
light-harvesting systems of cyanobacteria and red 
algae. Phycobilin lyases sequences are poorly anno- 
tated in public databases. Sequences included in 
CyanoLyase were retrieved from all available 
genomes of these organisms and a few others by 
similarity searches using biochemically characteri- 
zed enzyme sequences and then classified into 3 
clans and 32 families. Amino acid motifs were 
computed for each family using Protomata learner. 
CyanoLyase also includes BLAST and a novel 
pattern matching tool (Protomatch) that allow 
users to rapidly retrieve and annotate lyases from 
any new genome. In addition, it provides phylogen- 
etic analyses of all phycobilin lyases families, de- 
scribes their function, their presence/absence in 
all genomes of the database (phyletic profiles) and 
predicts the chromophorylation of PBPs in each 
strain. The site also includes a thorough bibliog- 
raphy about phycobilin lyases and genomes 
included in the database. This resource should be 
useful to scientists and companies interested in 
natural or artificial PBPs, which have a number of 
biotechnological applications, notably as fluores- 
cent markers. 



INTRODUCTION 

Oxygenic phototrophic prokaryotes (i.e. cyanobacteria) 
share with the eukaryotic classes Rhodophyta (i.e. red 
algae) and Cryptophyta the presence of phycobiliproteins 
(PBPs), which are water-soluble proteins chromophory- 
lated with brilliantly colored, linear-tetrapyrrolic 
pigments, called phycobilins (1). In red algae and most 
cyanobacterial species, different PBP types are assembled 
to form phycobilisomes (PBS), the major light-harvesting 
systems of these organisms, which are constituted of 
a central core surrounded by a number of radiating 
rods [usually six in cyanobacteria (2,3)]. Although the 
main antenna system of Cryptophyta is a membrane- 
intrinsic Lhc-type complex, like in all other photosynthetic 
eukaryotes (except red algae), cryptophytes possess a sec- 
ondary one made of tightly packed aggregates of one PBP 
type, either phycocyanin (PC) or phycoerythrin (PE), 
located in the thylakoid lumen in the proximity of photo- 
systems (4). 

Although the PBP and phycobilin composition of the 
PBS core varies little, because it is always composed of 
allophycocyanin (APC) that binds phycocyanobilin (PCB) 
as its only chromophore, the structure of PBS rods is ex- 
tremely variable between groups, and even within a given 
genus (3). In marine Synechococcus, for instance, six dif- 
ferent pigment types have been described so far (5), based 
on the various PBP composition of their PBS rods. 
Indeed, the latter can comprise one to three of four 
possible PBP types: PC, PE-I or PE-II and phycoerythro- 
cyanin [PEC, so far only found in freshwater species; see 
(3)]. Furthermore, the phycobilin composition of each in- 
dividual PBP itself varies, because it may bind one to three 
types of the four different possible phycobilin types: PCB, 
phycoerythrobilin (PEB), phycourobilin (PUB) and 
phycoviolobilin (PVB), which are isomers with distinct 
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spectral properties. PBPs generally consist of two 
subunits, a and p, organized into hexamers, and each 
subunit has either one (a- or P-APC, a-PC and a-PEC), 
two (a-PEI and p-PC) or three chromophore binding 
cysteinyl sites (p-PEI, a-PEII and P-PEII). Given this 
complexity, the phycobilin lyases, that is the enzymes 
that catalyze the ligation of chromophores to PBPs, con- 
stitute a particularly wide and diversified group of proteins 
(6,7). Among those which have been biochemically 
characterized, most are highly specific in vivo as they can 
ligate only one phycobilin type at one particular PBP 
binding site. However, CpcS (a member of the S/U clan) 
can bind either PCB or PEB to one specific site, Cys-82 
(consensus numbering), of a variety of PBPs (a- and 
P-APC and PE, p-PC and PEC) and is therefore more 
universal (8). Furthermore, some enzymes of the E/F 
clan are bifunctional, because they can both bind a 
chromophore (either PCB for PecE/F or PEB for RpcG) 
to a-PC and change its chemical configuration into 
another isomer [i.e. PVB or PUB, respectively; see (9-13)]. 

Here, we describe CyanoLyase, a sequence and motif 
database dedicated to the annotation of phycobilin 
lyases and related proteins. Indeed, these enzymes are 
often poorly annotated in public databases, especially se- 
quences coming from genome projects. Given the fact that 
PBPs have a growing number of biotechnological and bio- 
medical applications [see, e.g. (14) and references therein], 
this resource should be very useful to all scientists and 
companies interested in natural or artificial PBPs. 
Furthermore, the knowledge of the phycobilin lyase 
content of any given cyanobacterial strain can be used 
to predict the pigmentation of its PBS, even if the latter 
was previously unknown, and therefore, Cyanolyase pro- 
vides a list of the known and predicted chromophores at 
all binding positions of PBPs for most strains of the 
database. CyanoLyase also contains bioinformatic tools, 
BLAST (15) and a new pattern analysis suite Protomata 
[see (16) and http://tools.genouest.org/tools/protomata], 
that allow users to rapidly retrieve and annotate all 
lyases present in any new genome using a whole 
proteome file in Fasta format. In addition, it provides 
tables specifying the function of lyases and phyletic 
profiles [i.e. patterns of presence of orthologs in a set of 
genomes; see, e.g. (17)] allowing the user to determine the 
co-occurrence of lyase genes in the different strains of the 
database. 



DATA COLLECTION AND CURATION 

The CyanoLyase database is mainly composed of se- 
quences of characterized or presumed phycobilin lyases 
retrieved from genomes of cyanobacteria, red algae or 
cryptophytes. However, in view of forthcoming evolution- 
ary studies of this interesting enzyme group, the database 
also comprises sequences of a number of phylogenetically 
related proteins, such as NblB that is involved in PBS 
degradation during nitrogen starvation (18) or IaiH 
involved in iron-sulfur cluster biosynthesis (19), as well 
as other proteins with no characterized function to date. 
At the time of writing, CyanoLyase accounted 954 



sequences of phycobilin lyases and related proteins, 
coming from 84 genomes (mainly cyanobacteria). These 
sequences have been classified into three main clans [i.e. 
proteins sharing a common 3D structure; see, e.g. (20)] 
and 30 different families [i.e. groups of orthologous se- 
quences; see, e.g. (21)], a modification and extension of 
the previous classification proposed by Schluchter et al. 
(7). For members of the S/U and E/F clans, the 3D struc- 
ture was predicted using the Protein Fold Recognition 
Server Phyre 2 (22), while there is so far no structure 
that fits members of the T clan in public databases. The 
E/F clan was further subdivided into two subclans, based 
on both the phylogeny and the fact that enzymes belong- 
ing to E/F subclan 1 form either heteroduplexes or fusion 
proteins, whereas members of E/F subclan 2 apparently 
do not. Each family in our classification gathers proteins 
that we assume to have the same biological function. Some 
families were further divided into subfamilies, based on 
phylogenetic analyses, which for instance often split 
apart marine picocyanobacteria sequences from other 
cyanobacteria (7). 

To build the dataset included in the CyanoLyase 
database, an initial set of biochemically characterized 
phycobilin lyases was compiled from an extensive litera- 
ture survey and similarity searches were conducted using 
BLAST to retrieve highly homologous sequences in all 
cyanobacterial genomes available in RefSeq. After 
attributing these sequences to a given family and/or sub- 
family, conserved amino acid motifs were computed using 
Protomata learner. Then, using both BLAST and 
Protomatch (see Tools section below for details about 
Protomata learner and Protomatch), more distantly 
related sequences were identified in public databanks 
(e.g. RefSeq protein) and added to existing or newly 
created families. 

Sequences stored in the database are tagged as 'sure' or 
'unsure'. 'Unsure' sequences are sequences that, based on 
similarity or the presence of a conserved motif, have been 
affiliated to a given family, but for which matching scores 
are too low to ascertain that they have the same function 
as other members of this family (e.g. CMR092C from 
Cyanidioschyzon merolae or the two paralogous sequences 
AM1_4215 and AM1_C0217 from Acaryochloris marina 
have been attributed to the CpcS family, but tagged 
'unsure' because the identity to other members of this 
well-conserved family is lower than 53%). 'Unsure' se- 
quences are indicated in italics in the family description 
pages and are not used for motif and phylogeny 
inferences. 

All the sequences stored in CyanoLyase were manually 
curated, and modifications were made for some open 
reading frames (ORFs) that were not accurately defined 
in public databanks (GenBank). For instance, some 
protein sequences were missing a few residues at the 
N-terminus and were therefore extended, whereas others 
seemingly had too long N-termini due to a misplaced start 
codon and were then shortened. Also, two sequences 
(CpcSIII in Cylindrospermopsis raciborskii CS-505 and in 
Raphidiopsis brookii D9) were found to have a bacterial 
group II intron insertion, which led to the prediction of 
two independent ORFs by the annotation algorithms. In 
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this case, the intronic region was suppressed and the two 
ORFs fused. Whenever such an alteration was made to a 
sequence, this was reported in the corresponding remark 
field. 

Because most of the sequences of CyanoLyase come 
from GenBank, NCBI record IDs are included in the 
database, when available, to keep track of the origin of 
the data. 



DATA ACCESS 

A free, public access to the CyanoLyase database, tools 
and other features are available at http://cyanolyase. 
genouest.org. A brief description of this group of 
enzymes, some database statistics and direct access to 
bioinformatic tools are available from the 'Home Page 1 . 
Several browsing methods are available to access the 
information contained in the database: curated data, 
applications and additional features are arranged under 
independent pull-down menus, the main ones called 
'Genomes', 'Families', 'Functions', 'Pigmentation', 
'Blast', 'Protomatch', 'Phyletic profile' and 'References'. 

The 'Genomes' page lists all the genomes where at least 
one 'true' phycobilin lyase has been identified (i.e. not only 
a related sequence) and provides some information about 
strain taxonomy, classification, environment and the 
sequencing center and status. This list, like all others in 
CyanoLyase, is sortable and filterable to ease the naviga- 
tion. It can also be exported in various formats (csv and 
pdf). Clicking on a genome displays information about 
this genome, links to its RefSeq record (if available), 
some bibliographic references and the list of phycobilin 
lyase or related sequences that were found in this genome. 

The 'Families' page displays the classification of se- 
quences included in the database that are divided in 
clans, subclans, families and subfamilies. For each level, 
a hyperlink leads to a brief description, some biblio- 
graphic references and the list of sequences belonging to 
this group and, for most of the two lower levels, additional 
links give access to amino acid motifs. Each sequence 
stored in CyanoLyase has a dedicated page with details 
about the genome where the sequence comes from and the 
family in which it is classified. 

Because CyanoLyase keeps track of NCBI record IDs 
(GenBank, RefSeq) of sequences and genomes when avail- 
able, it offers users the possibility to display the genomic 
context of each phycobilin lyase gene (Figure 1) using the 
NCBI Sequence Viewer (http://www.ncbi.nlm.nih.gov/ 
projects/sviewer/). Using this tool, the user gets access to 
the genomic organization around lyase genes. This option 
is of particular interest because the latter genes are fre- 
quently organized in clusters with other genes involved 
in PBS biosynthesis and regulation (5). 

TOOLS 

Some bioinformatic tools are directly available on the 
CyanoLyase website to perform analyzes of new sequences 
to find novel members of the phycobilin lyase family. 
A BLAST (version 2.2.26+) form allows users to search 



for sequence similarity of query sequences in CyanoLyase 
databanks. These databanks (that can be selected using a 
scrolldown menu) comprise not only all individual phyco- 
bilin lyase and related protein families but also the whole 
proteomes of all cyanobacteria and red algae included in 
the database. BLAST results can be downloaded in various 
formats. Sequences already recorded in CyanoLyase and 
having an associated GenBank ID are highlighted in the 
result page by the presence of the abbreviation 'CL' (for 
CyanoLyase), just before the ORF ID in the BLAST 
result. 

CyanoLyase also gives access to an original motif dis- 
covery and matching tool suite: Protomata (version 2.0; 
http://tools.genouest.org/tools/protomata). This software 
allows the user to discover motifs in sets of related se- 
quences, focusing only on most conserved regions, repre- 
sented as blocks with a letter size proportional to the 
amino acid frequency at any position (Figure 2). Most 
often, multiple blocks are detected for each dataset, and 
each block can be found in all the sequences or only a 
subset of them. With this tool, it is possible to detect the 
regions common to all the sequences of a protein family, 
but also regions shared only by some sequences that may 
constitute a subfamily. Using Protomata learner, a motif 
has been generated for most lyase family and subfamily. 
These motifs are available online and can be used directly 
from the web interface to search for motif matching on 
new protein sequences using Protomatch. 

Using both BLAST and Protomatch allows users to 
rapidly retrieve all the putative phycobilin lyases or 
related sequences present in any new genome within a 
few minutes, directly in a web browser. The same tools 
can also be used to search for such sequences in public 
databanks such as RefSeq protein or NR. This type of 
search will be done on a regular basis by CyanoLyase 
authors to keep the database updated. 

In addition, a phylogeny program has been integrated 
into CyanoLyase. It performs a succession of three tasks: 
(i) using Muscle (23), it automatically generates multiple 
alignments of protein sequences for each level from 
subfamilies to clans, (ii) using trimAl (24), it eliminates 
gaps and low-quality regions and (hi) using PhyML (25), 
it generates Maximum Likelihood trees. The multiple 
alignments and phylogenetic trees (under both Newick 
format and radial visualization) are available in the 
'Phylogeny' boxes that appear in the description of each 
group of sequences at all levels of the classification. 



ADDITIONAL FEATURES 

The 'Function' menu summarizes in a table the 
characterized or predicted functions of phycobilin lyases 
in terms of the bound chromophore (PCB or PEB), en- 
zymatic activity (lyase or lyase-isomerase), substrate (apo- 
protein targeted by the enzyme) and binding sites (cysteine 
positions at which phycobilins are bound). These data 
refer to literature data listed in the 'References' menu 
(see below). It is important to note that phycobilin 
lyases often have a lower specificity in vitro than in vivo 
(9,26), but we chose to display their in vivo properties. 
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Figure 1. Use of the Browser NCBI to visualize the genome context of lyase genes in the genome of Synechococcus sp. CC9311. The upper track 
shows the region with a low level of zoom and the middle track with a higher level of zoom and show (in red) all ORFs with their IDs (or gene 
names) in the RefSeq record. The bottom track displays (in black) genes present in the CyanoLyase database and their corresponding name. 
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Figure 2. Example of motifs created by Protomata learner for members of the E/F subclan 1 . Motifs for the PecE (A) and PecF (C) subunits of the 
PCB:Cys-84 cc-PEC lyase-isomerase PecE/F(12) and for the corresponding N-terminus (B) and C-terminus (D) of the closely related fusion protein 
RpcG, a PEB:Cys-84 oc-PC lyase-isomerase (9). Note the similarity of motifs independently found by Protomata learner for these two sets of protein 
sequences, and in particular the presence of the well-conserved motifs YyaAWWL, biochemically characterized as essential for the lyase activity of 
PecE (13) and nHCQGn, conferring its isomerase activity to PecF (13), in the N- and C-terminus of RpcG, respectively. 



The 'Pigmentation' page displays the PBP content for 
each strain listed in the genome page and the chromo- 
phores found at the different binding sites of PBP 
subunits. The latter information was either determined 
biochemically (bold letters) or predicted by CyanoLyase 
authors (normal letters), based on the presence of specific 
phycobilin lyases in the genome. 



The 'Phyletic profiles' page shows the distribution of 
phycobilin lyases or related proteins in the different 
genomes of the CyanoLyase database. It allows the user 
to easily spot sequences that are present in a comparable 
set of genomes, and this is particularly useful to predict 
potential heterodimers. For instance, this tool allowed us 
to find that marine Synechococcus spp. likely possess a 
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CpcU-like lyase (that we called CpcU-II), which likely 
makes an heterodimer with CpcS-II, by analogy with 
CpcU-I, which forms a heterodimer with CpcS-I in a 
number of freshwater cyanobacteria (7,27). In contrast, 
CpcS-III does not seem to form heterodimers (8). 

The 'References' page comprises an exhaustive bibliog- 
raphy about the function of the different lyases and the 
description of genomes included in the database. These 
references are referred to, when appropriate, in the differ- 
ent sections of the CyanoLyase website. 

IMPLEMENTATION AND DATA EXPORT 

CyanoLyase was developed using the Symfony 2 PHP 
framework (http://symfony.com/). Some modules that 
were used to build the application are available under an 
open source license on github (https://github.com/ 
genouest/). The web interface was designed with 
XHTML, CSS and JQuery, and data were integrated 
into a MySQL database. These data can be exported in 
Fasta format from the web interface either as individual 
sequences or as multiple sequence files, for example if one 
wishes to retrieve all members of a particular family or of 
a genome. 

CURRENT SCOPE AND FUTURE PERSPECTIVE 

The first release of CyanoLyase provides an extensive col- 
lection of phycobilin lyases and related proteins, classified 
in clans, subclans, families and subfamilies. The website 
also gives access to bioinformatic tools to ease the anno- 
tation of these sequences in forthcoming genomes of 
PBP-containing organisms. As such, the website will be 
updated regularly as new data become available and will 
therefore be a long-term resource. Users can monitor 
directly from the web interface the latest changes that 
have occurred in the database using the corresponding 
scrolldown menu. 

CyanoLyase aims to be a reference resource about the 
classification of phycobilin lyases and their respective pre- 
dicted or characterized functions. Future versions could 
include sequences from metagenomic samples or viruses, 
and the same kind of resources could be built for other 
complex and poorly annotated groups of protein se- 
quences, such as the polypeptide linkers that maintain 
the PBS assembly (3). 
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